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Chapter 1 

Compressed Sensing 



1.1 What is Compressed Sensing? 

Let us consider the following image which will be used extensively throughout 
this introduction. 

To insert this image into the report I pointed my editor to the file called 
Lena, jpg which was automatically added to the text. Let us take a closer view 
at the file. The size of the image is 512 x 512 pixels. Each pixel contains an 
integer number in the range between and 255 hence, there is one byte (8 bits) 
of storage reserved to keep the value of each pixel. Simple calculations show 
that the file should occupy at least 512 x 512 x 1 = 262144 bytes, however, the 
size of the file is only 44791 bytes. To understand where the difference came 
from let us review the way this file arrived to my computer. First, the person 
portrayed in the image was photographed by an analogous camera. Then, the 
film or the photograph was scanned by a digital scanner to form a digital image 
that is already suitable for use by the computer. Note that the size of this image 




Figure 1.1: Lena 
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Figure 1.2: 2D DCT on 64 x 64 patches 



is512x512, assuming that the original film had dimensions of 6 x 6 centimeters 
we conclude that the resolution of the scanning (sampling) process was about 
85 samples per centimeter (in both directions) . Sampling rate is important if we 
want to reconstruct the original analog image. Such a reconstruction is possible 
if we sample with the sufficiently high rate, as defined by the Nyquist- Shannon 
theorem. However, we shall return to our image which currently requires 262144 
bytes of storage. In the next step it undergoes a process called compression. 
One of the most common compression techniques for images is known under 
the name JPEG compression. We describe here a very primitive version of this 
compression method which, nevertheless, contains the most important details. 
The images is splitted into square blocks and each block undergoes the two- 



dimensional discrete cosine transfrom (DCT), as depicted in Figure 1.2 

Note that the most energy (reddish colors) is concentrated in a relatively 
small fraction of the DCT coefficients, thus the coefficients that are close to 
zero are discarded and we end up with a significantly smaller number of the 
DCT coefficients that represent the original image without significant visual 
artifacts. The original JPEG compression does not stops here; it has a number 
of further steps. Those are not important for our discussion. I would like to 
draw your attention to the two important effects we achieved by discarding some 
of the DCT coefficients. First, we, obviously, lost some information about the 
original image, thus the compression method is lossy. Second, we used only 
a small number of the original coefficients which means that we switched to 
a sparse representation of the original patches in the DCT base. The latter 
property is of paramount importance as we shall see in the sequel. 

It turns out that sampling signals with high resolution may be wastfull in 
case that the signals of interset are compressible or, alternatively, are known to 
have a sparse representation in some basis. Compressed Sensing is an emerging 
framework dealing with efficient ways to sample such signals. 
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1.2 Formal definition 



Consider a signal x € W 1 that is known to have a sparse representation over a 
certain dictionary D g M™ xfc . i.e,. 

x = D6, 

where \\0\\ < S <C n. The classical sampling methods would require n samples 
per signal. The Compressed Sensting, on the other hand, suggests to replace 
these n direct samples with m inderect ones by measuring linear projections of 
x defined by a projection matrix P € ]R mxn 

y = Px, 

such that S < m <C n. That means that instaed of sensing n elements of 
the original signal we can sense it directly in compressed form, by sampling a 
relatively small number of linear projections 

Vi = (Pi,%) ■ 

The main question with whether the original signal can be reconstructed from 
this insufficient number of samples. Surprisingly, the answer is yes. In this 
paper we are going to review the methods to design efficient projections, but, 
meanwhile we present some of the notorious projections used in practice in 
Figure [T31 



3 




(b) line projection (tomography) 




(c) sinusoid (MRI) 




(d) random projection 

igure 1.3: Graphical representation of common projections. 
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1.3 Designing optimal projections 



In this section we consider the case when the dictionary D is fixed while the 
projection matrix P can be arbitrary. Hence, we would like to design P that 
will suit us in the best way. Before we proceed with the design let us define an 
important property that characterizes a dictionary D. 

Definition 1. For a dictionary D, its mutual coherence is defined as the largest 
absolute and normalized inner product between the different columns of D 

" (D) = , vr¥mlv 

where denotes the i-th column of D. 

An alternative definition of the mutual coherence is via the largest (by mag- 
nitude) off-diagonal value of the Gram matrix 

G = D T D, 

where D denotes the "normalized" version of D, i.e., the columns of D are scaled 
to unity (in I2 norm). 

The mutual coherence provides a quantitative measure of how far are the 
columns of D from being orthogonal to each other. It also has a strong influence 
on the theoretical analysis of worst case scenarios, as implied by the following 
theorem. 

Theorem 2. Given a signal x = DO such that 

\\eh<\(i+ ' 



2 V fi(D) 
then the following three results hold: 

1. The vector 6 is necessarily the sparsest one to describe x, i.e., it is the 
unique solution of 

min||0|| o s.t. x = DO. 

9 

2. The Basis Pursuit (BP) algorithm that approximates the exact solution 
by solving the linear problem 

min||0||i s.t. x = D6, 

9 

is guaranteed to find the exact solution 9. 

3. The Orthogonal Matching Pursuit (OMP) that is a greed approximate 
algorithm which iteratively solve the least squares problem 

\\x-D8\\l 

for only one component of 6 per iteration is also guaranteed to obtain the 
unique solution 6. 
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With the aforementioned properties of \i there is an obvious reason to design 
the projection matrix P in a way that minimizes the mutual coherence /i(PD). 
Note the treatment of the mutual coherence has addressed only the worst case 
behavior so far. This analysis does not provide any insight into the "average" 
behavior of the pursuit algorithms. It is possible that relaxing the requirement 
on the maximal \i value, that, on the other hand, leads to decrease in the average 
/j, may lead to better average performance of the pursuit algorithms. Following 
this idea Elad introduced another measure which is defined as follows. 

Definition 3. For a dictionary D, its t-average mutual coherence is defined 
as the average of all absolute and normalized inner products between different 
columns in D (denoted as 5^) that are above t. Formally it reads 

For t — 0, we obtain a simple average of the absolute entries of G. As the value 
of t grows, we obtain that Ht{D) grows and approaches n(D) from below. It is 
also evident from the definition that m(D) > t. 

1.3.1 Elad's method 

Along with the definition given above, Elad, who followed the approach of 
Dhillon et ai, also suggested and iterative algorithm that that tries to mini- 
mize the value fj, t (PD) with respect to P, assuming that both the dictionary D 
and parameter t are given and fixed the algorithm may be formalized as follows. 
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Algorithm 1.1 Elad's algorithm for the projection matrix P optimization. 



Objective: Minimize /i t (PD) with respect to P. 
Input: Use the following parameters: 

• t or t% - fixed or relative threshold, 

• D - the dictionary 

• p - number of measurements 

• 7 - down-scaling factor 

• Iter - number of iterations 

Initialization: set Pq £ R mx ™ to an arbitrary matrix 
Loop: Set q = 

(iteration counter) and repeat Iter iterations: 

1. Normalize: Normalize the columns of P q D and 
obtain the effective dictionary D q . 

1. Compute Gram Matrix: G q = DjD q . 

3. Set Threshold: If mode of operation is fixed, use 
t as threshold. Otherwise, chose t such that t% of 
the off -diagonal elements in G q is above it. 

4. Shrink: Update the Gram matrix and obtain G q by 



8. Advance: Set q = q + l. 



As we can see from the algorithm, it allows a slightly different definition of 
t. Instead of being constant it varies from iteration to iteration in the way that 
only t% of the off-diagonal entries of the Gram matrix G are altered during the 
current iteration. 
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Note that the shrinkage function used in the algorithm 



!79ij \9%j\>t 
ftj 7*>l9yl 

tries to preserve the ordering of the absolute entries in the Gram matrix. Thus, 
the entries whose magnitude lies between t and "ft are "shrunk" by a smaller 
amount, as shown in Equation This approach leads to a better distribution 
of the off-diagonal elements' values, unfortunately, it also creates some large 
values, that were not present in the original matrix. Large off-diagonal values 
ruin completely the worst-case guarantees of the pursuit methods. The methods 
presented in the next two subsections has no such undesired property. 

1.3.2 Duarte-Carvajalino & Sapiro's method 

Unlike the previous method, this one is non-iterative or, more precisely, the 
number of iterations is very small and constant. It was suggested by Duarte- 
Carvajalino and Sapiro. Their approach is as follows. Consider the Gram matrix 
of the effective dictionary PD: 

G = D T P T PD, (1.2) 

which should be as close to the identity matrix as possible, i.e., we would like 
to find such P that gives approximately 

D TpTpT DKi j ( 13 ) 

By multiplying both sides of the previous expression by D on the left and D T 
on the right we obtain 

DD T P T P T DD T w DD T . (1.4) 

Now, let us consider the singular value decomposition of DD T which is known, 
of course, 

DD T = VAV T , 

then Equation |1.4| becomes 

VAV T P T PVAV T VAV T , (1.5) 

which is equivalent to 

AV T P T PVA& A. (1.6) 

By denoting T — PV we finally formulate our problem minimization of the 
following functional with respect to T 

||A- Ar T rA|||. (1.7) 
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Note that if D is an orthonormal basis and m — k, then the above equation 
would have an exact solution that produces zero error, i.e., Y = A -1 / 2 . However, 
since the dictionary is over-complete and m -C k we have to find an approximate 
solution that minimizes the error in Equation |1.7| Note that this time the error 
will not be equal to zero, in general. The solution algorithm, as suggested by 
the authors is as follows. 

Let Ai, A2, . . . , A„ be the singular values of the known diagon al m atrix A, 
ordered in decreasing order, and T = [t± . . . r m ] T . Then, Equation 1.7 becomes 



i=0 



(1.8) 



where u,; = [Ajr^i . . . X n Ti^ n ) T , or equivalently, 



(1.9) 



Let us define E = A — £\ v i v I '> Ej = A — v i v 1 1 anc ^ ^ ^3 = UjAjUj be 
the singular value decomposition of Ej. Then, Equation 1.9 becomes 



U 3 II F 



k=l 



If we set Vj 



being the larges singular value of Ej and u± t j its 



corresponding eigenvector, then the largest error component in E is eliminated. 
Replacing Vj back in term of r 3 - (the rows of the matrix we are optimizing for, 
T = PV), 

[AiTj-,1 . . . \ n Tj tn ] T = x ,. (1-10) 

Since the matrix A is in genereal not full-rank, then, for some r > 0, A„_ r +i , . . . , A If 
will be zero, and we can only update the r^i, . . . , Tj t n-r components of Tj. This 
derivation forms the basis of the algorithm for optimizing (1.7 1. The formal 
algorithm is given below. 
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Algorithm 1.2 Duarte-Carvajalino & Sapiro's algorithm for the projection 
matrix P optimization. 

Objective: Minimize ||A — 

A(PV) T (PV)A\\ 2 F with respect to P. 

Input: Use the following parameters: 

• D - the dictionary 
Initialization : 

1. find the singular value decomposition DD T — VAV T 

2. set Pq G Jj mxn to an arbitrary random matrix 

3. set T = P D 
Loop : Set q = 

(iteration counter) and repeat m iterations: 

1 . Compute E q . 

2. Find the largest singular value and the 
corresponding eigenvector of E q , ^i.^and Ui >q . 



3. Use (1.10 1 to update the first r components of T q 
(thereby updating D . 

4. Compute the optimal P = TV T . 
EndLoop 



1.3.3 Our method 

To overcome this drawback of the Elad's method we propose another algorithm 
that models the problem in a different way. We formulate the following feasibil- 
ity problem. Find a symmetric matrix G € M fexfc subject to the two constraints: 
first, the rank of G is m < k; second, the off-diagonal entries must obey \gij \ < t, 
while the diagonal entries must be greater than or equal to unity: \gn\ > 1. To 
solve the feasibility problem we apply the method of alternating projections, 
where the current estimate G q is "projected" onto the constraint sets alterna- 
tively, as described in the algorithm shown in Figure [T3| 
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Algorithm 1.3 Our algorithm for the projection matrix P optimization. 



Objective: Minimize fi t {PD) with respect to P. 
Input: Use the following parameters: 

• t - fixed threshold, 

• D - the dictionary 

mm- number of measurements (rows of P) 

• Iter - number of iterations 
Initialization: set Go £ 

jgifcxfc tQ ^ arD itrary random symmetric matrix 
Loop : Set q = 

(iteration counter) and repeat Iter iterations: 

1. Project onto the convex set: Update the Gram 
matrix and obtain G q by 

(a) updating the off -diagonal elements (i ^ j) : 

fa = t ■ sign(^) if \gij\ > t 

(b) updating the diagonal elements: 

9u = 1 if 9u < 1 

2. Project onto the non- convex set: force rank of G q 
to be equal to m and G ~ PD . 

3. Advance: Set G q+ i — G q ; q = q + l. 

EndLoop 
Return: 

1 . Normalize G : 

G = dia s ( i a- 1 m \ ) * G i * dia § ( iv x= m \ ) ; 

\V dla s( G 9)/ \v dla s( G 9)/ 

2. Find P: solve D T P T PD = G for P. 



The main difference between the two algorithm is the "shrinkage" methods, 
which are shown graphically in Figur^L4j and absence of additional parameter 
7 in our method. Our method does not produce large off-diagonal elements, in 
fact, it succeeds to remove all elements that are larger than the given threshold 
t, provided that the target was not too aggressive. Effect of applying both 
algorithm to the distribution of the magnitude of the off-diagonal entries is 
demonstrated in Figure |1.5| In this experiment we used a random dictionary 
D of size 200 x 400. Its entries we randomly drawn from a normal distribution 
with zero mean and unity variance. Number of the projections were set to 30, 
i.e., the size of the projection matrix was 30 x 200. Effective threshold we used 




-1 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8 1 

Input value 



Figure 1.4: Shrinkage operaion in Elad's and our methods (f = 0.5, 7 = 0.6) 



was 26% for the Elad's algorithm and 0.26 for our method. The parameter 7 
was set to 0.6 in the Elad's algorithm. 
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(a) Histogram of the absolute values of the off-diagonal entries of the Gram matrix G before opti- 
mization. 
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(d) Our method 



Figure 1.5: Histogram of the absolute values of the off-diagonal entries of the 
Gram matrix G before the optimization and afterwards. 



After the optimization, the projection matrix P is both cases does not reveal 



any specific structure, as we can see in Figure (1.6 1 



-»■ . v: . i"^ JI ■ ■■■■ ji /"^■'■■r- 



(a) Original 




(b) After Duarte-Carvajalino & Sapiro's method 
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(c) After Elad's method 




(d) After our method 

Figure 1.6: Projection matrix after optimization (a subset of 100 columns). 
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(a) Initial random projection matrix P 
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(b) After Duarte-Carvajalino & Sapiro's method 
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(c) After Elad's method 
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(d) After our method 

Figure 1.7: Histogram of the values of the projection matrix P before the opti- 
mization and afterwards. 



It is also interesting to look at the distribution of the values in the projection 
matrix before the optimization and afterwards. It seems that the values are still 
drawn form the Gaussian distribution with zero mean, but the variance is much 
lower. Our experiments indicate that this is a typical result, that does not 
depend on the initial guess. 

If we consider the final distribution of the off-diagonal entries of the Gram 
matrix we will find out that it is quite different, as is evident from Figure [TT5] 
The correlation between the optimized projections matrix P is yet to be the- 
oretically analyzed, therefore, to evaluate the performance of the compressed 
sensing with and without the optimized projections we performed the following 
tests: 

Stage 1: Generate data: choose a dictionary D <= M. nxk and synthesize N test 
signals {xi}^L 1 by generating N sparse vectors {9i}^L 1 of length k each, 
and computing x\ = D9i for all i. All representations are built with 
the same low cardinality ||cj|| = S. 

Stage 2: Random projections: for a chosen number of projections m generate a 
random projection matrix P £ K mxn anc j apply it to the signals to obtain 
Hi = Pxi. Compute the effective dictionary D = PD. 

Stage 3: Performance tests: apply the BP and OMP to reconstruct the signals 
by approximate the solution of 

§i = argmin ||0|| o s.t. y t = D9,. 
o 

The result obtained by the pursuit algorithms is compared against the true 
solution by evaluating the error \\8i — 0i\\. Errors above some threshold 
are considered as a reconstruction failure. 

Stage 4: Optimized projections: the evaluations above are repeated for the 
projection matrix as returned by the optimization algorithms. 

We have followed the above stages to evaluate how the influence of the projection 
matrix optimization varies along with the cardinality of the true solution. In our 
experiment we used a random dictionary of size 200 x 400, i.e., n = 200, k = 400. 
For different of S (S = 1, 2, ... , 10) we generated N = 10000 sparse vectors of 
length k = 400 with S non-zero values in each. The locations of the non-zero 
components were chosen at random and populated with i.i.d. zero-mean and 
unit variance Gaussian values. These sparse vectors were used to create the 
example signals that were used in the evaluation of the CS performance. The 
results for the OMP and BP are depicted in Figures |1.8| and |1.9| respectively. 



1G 
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S - cardinality of the input signals 

Figure 1.8: CS reconstruction relative errors as a function of the signals' cardi- 
nality S. Using OMP method. 
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S - cardinality of the input signals 

Figure 1.9: CS reconstruction relative errors as a function of the signals' cardi- 
nality S. Using BP method. 



1.4 Simultaneous optimization of the projection 
matrix and the dictionary 

The idea of designing an optimal projection matrix can be taken further if we 
allow the dictionary D to be changed as well. Let us recall the K-SVD algorithm 
that is is used for dictionary training. Given an n x p matrix X — [x% , x% , . . . , x p ] 
of p training images of length n pixels each, used to train on overcompletc 
dictionary D of size n x k, with p^> k and k > n. The objective of the K-SVD 
algorithm is to solve, for a given sparsity level, 

minllX-Deili s.t. ||6U| < S, (1.11) 

D,& 

where @ = [9\ , 02 , ■ ■ ■ , 6 P ] , and Qi is the sparse vector of coefficients representing 
the i-th image in terms of columns of the dictionary D — [d\, d®, ■ • ■ , <ffc]. K- 
SVD is an iterative algorithm that progressively improves the functional in 
Equation ( |1.11[ ) as described next. First, the algorithm freezes the current 
dictionary D and solves for the coefficients' matrix using one of the pursuit 
algorithms. Next, it assumes that the matrises and D are fixed except for 
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one column of D: dj. Finding the best dj is done by re-writing the functional 
in Equation ( |1.11| as follows 



\X-DQ\ 



X 



dj9 J T 



E; 



(1.12) 



where 6° T denotes the j-th row of 0. It is tempting to use the SVD to solve for 
new dj and 9 J T , however, that will lead to a dense 8 J T which is absolutely unde- 
sirable. The K-SVD algorithm overcomes this difficulty by using a small subset 
of columns of X that use the atom dj in their representation. Consequently, the 
only updated entries of J T are those that are non-zero at the moment. Hence, 
the algorithm does not increase the density of the coefficients matrix 0. After 
running over all columns of D the algorithm returns to the first step and cycles 
until convergence, or maximum number of iterations. There exists experimen- 
tal evidence that the K-SVD algorithm is sensitive to the initial guess, however, 
this question is not addressed here. 

The Coupled-KSVD algorithm suggested by Duarte-Carvajalino and Sapiro 
extends the KSVD for simultaneous training of the projection matrix P and 
dictionary D with images available from a dataset. They define the following 
objective function 



min X\\X-De\\i 
p.D.e r 



\Y-PDQ\\p s.t. 



7. 



< S, 



(1.13) 



where the scalar A € [0, 1] controls the relative weight of the two terms and Y 
are linear samples given by 

Y = PX + N, (1.14) 

where N represents an additive noise introduced by the sensing system. Let us 
denote 

AX \ „ r /A/ N 



Z 



Y 



, W 



P 



(1.15) 



Then, Equation (1.131 can be re- written as 



mm Z - D6 
p.D.e 



s.t. ||^|| < S, 



where D = WD. Now, in exactly the same manner as we did in the K-SVD we 
run over all columns of D one by one and find simultaneous solution for dj and 



9 j,. Then, we substitute back 



dj = ( ) (/, 



(1.16) 



At this point, we have m+n equations and n unknowns. The solution is obtained 
by solving the overdetermined system in the sense of the least squares, thus the 
unknown column dj can be found by 



d J = (x 2 i + p T py 1 {xi p T )d, 



(1.17) 
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Algorithm 1.4 Coupled KSVD. 

Objective: Minimize A ||X — _D0|| F + 
\\Y-PDef F s.t. ||0,|| o < 
S with respect to P,D, and 0. 
Input: Use the following parameters: 

• X ,Y - the training data and its projections 

• A - weight of the first term (0 < A < 1) 

Initialization : 

1. set initial dictionary Dq £ W xn to an arbitrary 
random matrix 

Loop : Set q = 

(iteration counter) and repeat until convergence: 

1. For fixed D q compute P q using an algorithm from the 
previous section. 

2. For fixed D q and P q compute Q q using a pursuit 
algorithm, e.g. , OMP. 

3. Using update P q , D q , and Q q by the K-SVD, as 
described in Equations |1 . 16| , [T7l7] , and |1 . 18| . 

EndLoop 



Finally, the norm of dj is adjusted to unity (of course, this step must be accom- 
panied by adjusting the row 9 3 T ) to keep the product djOip without any change), 

i.e., 

^ = ^||^.|| 2l di^dj/Wdih (1-18) 

After we finished the update of the dictionary D and the coefficient matrix 
0, assuming fixed P, we can update the projection matrix, with the methods 
described in the previous section, i.e., learning the projection matrix from the 
just updated dictionary D. Then, we repeat the algorithm until convergence. 
Hence, the whole algorithm is formalized in Algorithm [L4j Experimental results 
of this approach are available in the original paper. 

1.5 Conclusions 

We presented several algorithms for optimization of the projection matrix used 
in the Compressed Sensing. Experimental results indicate that our method 
performs better than the the other two. Simultaneous method of optimiza- 
tion of the projection matrix and the sparsifying dictionary as suggested by 
Duarte-Carvajalino and Sapiro was presented in a descriptive manner without 
experimental results. 
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